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The natural gradient, as introduced by |Amari, 1987] , allows for more effi- 
cient gradient descent by removing dependencies and biases inherent in a func- 
tion's parameterization. Several papers present the topic thoroughly and pre- 



cisely [Amari, 1987l|Amari, 1998l|Amari and Nagaoka, 2000||Th"eis, 2005[|A"mari, 2010J . 
It remains a very difficult idea to get your head around however. The intent of 
this note is to provide simple intuition for the natural gradient and its uses. We 
review how an ill conditioned parameter space can undermine learning, intro- 
duce the natural gradient by analogy to the more widely understood concept of 
signal whitening, and present tricks and specific prescriptions for applying the 
natural gradient to learning problems. To our knowledge, this is the first time a 
connection has been made between signal whitening and the natural gradient. 

1 Natural gradient 
1.1 A simple example 

We begin with a simple probabilistic model which has clearly been very poorly 
parametrized. For this we use a two dimensional gaussian distribution, with 
means written in terms of the parameters € 1Z 2 , 
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As an objective function J (0) we use the negative log likelihood of q (x; 0) under 
an observed data distribution p (x) 



J(0) = -(logg(x;0)) 



p(x) ■ 
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Using steepest gradient descent to minimize the negative log likelihood involves 
taking steps like 
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As can be seen in Figure [T]a the steepest gradient update steps can move 
the parameters in a direction nearly perpendicular to the desired direction. 
q (x; 9) is much more sensitive to changes in 9\ than 9i , so the step size in 9\ 
should be much smaller, but is instead much larger. In addition, 9\ and 9i 
are not independent of each other. They move the distribution in nearly the 
same direction, making movement in the perpendicular direction particularly 
difficult. Getting the parameters here to fully converge via steepest descent is 
a slow proposition, as shown in Figure ^p. 

The pathological learning gradient above is illustrative of a more general 
problem. A model's learning gradient is effected by the parameterization of the 
model as well as the objective function being minimized. The effects of the 
parameterization can dominate learning. The natural gradient is a technique to 
remove the effects of model parameterization from learning updates. 



1.2 A metric on the parameter space 

As a first step towards compensating for differences in relative scaling, and cross- 
parameter dependencies, the shape of the parameter space 9 is first described 
by assigning it a measure of distance, or a metric. This metric is expressed via 
a symmetric matrix G (9), which defines the length \d9\ of an infinitesimal step 
d9 in the parameters, 

\d9\ 2 = G v i ) d8ide 3 = d0TG ( ) d0 - ( 5 ) 

» 3 

G (9) is chosen so that the length \d6\ provides a reasonable measure for the 
expected magnitude of the difference of J (9 + d9) from J (9). That is, G (9) is 
chosen such that \d9\ is representative of the expected magnitude of the change 
in the objective function resulting from a step d9. There is no uniquely correct 
choice for G (9). 

If the objective function J (9) is the log likelihood of a probability distribu- 
tion q(x;9), then a measure of the information distance between q (x; 9 + d9) 
and q (x; 9) usually works well, and the Fisher information matrix (Equation 
30 ) is frequently used as a metric. Plugging in the example from Section 



32 + \_ 1 

the resulting Fisher information matrix is G = 32 1 

1 

1.3 Connection to covariance 

G (9) is an analogue of the inverse covariance matrix Just as a signal 

can be whitened given S _1 — removing all first order dependencies and scaling 
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and natural gradient descent 
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KL divergence vs. update step for steepest end natural gradient descent 
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Figure 1: (a) The parameter descent paths taken by steepest gradient descent 
(red) and natural gradient descent (blue) for the example given in Section [TTT| 
The parameters are initialized at dina = [1,— 1] T , and are fit to data gener- 
ated with 6 t rue = [0, 0] T . The Fisher information matrix (Equation 30 1 is used 
to calculate the natural gradient. Notice that steepest descent takes a more 
circuitous and far slower path, (b) The KL divergence between the data dis- 
tribution and the fit model as a function of number of gradient descent steps. 
Descent using the natural gradient converges more quickly, (c) The arrows give 
the gradient of the log likelihood objective (Equation [2| , for a grid of param- 
eter settings. This is the descent direction provided by Equation [4] (d) The 
gradient of the same log likelihood objective (Equation [5| , but in terms of the 
whitened, natural, parameter space <f> as described in Section |1.4| Note that 
steepest descent in the whitened space converges directly to the true parameter 
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Unwhitened data 
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Figure 2: 
tribution 
v = Wx 




(b) 



Whitened data 
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Example of signal whitening, (a) Samples x from an unwhitened dis- 
in 2 variables, (b) The same samples after whitening, in new variables 

= S-5X. 



the variance in each dimension to unit length — the parameterization of J (9) 
can also be "whitened," removing the dependencies and differences in scaling 
between dimensions captured by G (9). See Figure [2] for an example of signal 
whitening. 

As a quick review, the covariance matrix S of a signal x is defined as 

£ = (x/) . (6) 

The inverse covariance matrix is frequently used as a metric on the signal x. 
This is called the Mahalanobis distance. It has the same form as the definition 
of \d8\ 2 in Equation [5J 

l^lLhaianobis = ^S-^X. (7) 

In order to whiten a signal x, a whitening matrix W is found such that the 
covariance matrix for a new signal y = Wx is the identity matrix I. The signal 
y is then a whitened version of x, 

1 = <yy T ) = w(xx T ) w T = wsw T . (8) 

Remembering that S _1 is symmetric, one solution^] to this system of linear 
equations is 

W = XT5 (9) 

y = £-5 X . (io) 

If the covariance matrix for y is the identity, then the metric for the Mahalanobis 
distance in the new variables y is also the identity (|<2y| Mahalanobis = y T y)- 
, _ i 

Choosing W = X 2 leads to symmetric, or zero-phase, whitening. In some fields it is 
referred to as a decorrelation stretch. It is equivalent to rotating a signal to the PCA basis, 
rescaling each axis to have unit norm, and then performing the inverse rotation, returning the 
signal to its original orientation. All unitary transformations of S _ 5 also whiten the signal. 



4 



Whitening is a common preprocessing step in signal processing. It prevents 
incidental differences in scaling between dimensions from effecting later process- 
ing stages. 



1.4 "Whitening" the parameter space 

If G is not a function of 9, then a similar procedure can be followed to produce 
a "whitened" parameterization <j>. We wish to find new parameters 4> = W9 
such that the metric G on is the identity I, as the Mahalanobis metric X -1 
is the identity for a whitened signal. This will mean that a small step d<p in any 
direction will tend to have the same magnitude effect on the objective J (</>). 
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one solution to this system 


of linear 



equations is 

W = G3 (17) 
</> = G5<9 (18) 

Steepest gradient descent steps in terms of (f> descend the objective function 
in a more direct fashion than steepest gradient descent steps in terms of 9, as 
is illustrated in Figure [T]c and [l]i. In cf>, the steepest gradient is the natural 
gradient. 

G is almost always a function of 9, and for most problems there is no param- 
eterization which will be "white" everywhere. So long as G (9) changes slowly 
though, it can be treated as constant for a single learning step. This suggests 
the following as an algorithm for learning in a natural parameter space. 

1. Express J (•) in terms of natural parameters </> = G' (9 t ) 9. 

2. Calculate an update step A0 oc J (4>t), where </>t = (9 t )9t. 

3. Calculate the 9t+\ = G~5 (# 4 ) (^ + A<fi) associated with the update to <j>. 

4. Repeat 

The resulting update steps more directly and rapidly descend the objective 
function than steepest descent steps. 



2 Practically, G (9) can usually be treated as constant for many learning steps. This allows 
the natural gradient to be combined in a plug and play fashion with other gradient descent 
algorithms, like L-BFGS, by performing gradient descent on J (<j>) rather than J (9). 
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1.5 The natural gradient in 6 



The parameter updates in Section |L4| can be performed entirely in the original 
parameter space 9. The natural gradient VgJ (9) is the direction in 9 which is 
equivalent to steepest gradient descent in <\> of J {(f)). In order to find VgJ(9), 
we first write Acf> in terms of 9, then we write the natural gradient update step 
in 9, A9, in terms of A<f>, 



A<j) cx V J(0) (19) 
V e J(9) (20) 



do N 1 



= G-*V fl J(0) (21) 



(where is the Jacobian matrix), 



K0 « ( 22 ) 



06 

= (23) 
cx G -1 V fl J(6). (24) 



Since the natural gradient update step is proportional to the natural gradi- 
ent, AO cx V# J (0), the natural gradient can be written as 

VgJ(d) = G- 1 (0)V 6 J(0) (25) 

Figure [TJa illustrates this gradient applied to the example objective function 
from Section |1.1| If gradient descent is performed by infinitesimal steps in the 
direction indicated by VeJ(0), then the parameterization of the problem will 
have no effect on the path taken during learning (though choice of G (8) will 
have an effect). 



2 Recipes and tricks 

In this section we present a reference with key formulas for using the natural 
gradient, as well as approaches useful for applying the natural gradient in specific 
cases. 

2.1 Natural gradient 

The natural gradient is 

W e J(8) = G- 1 (9)V e J(8) (26) 

where J (9) is an objective function to be minimized with parameters 9, and 
G (0) is a metric on the parameter space. Learning should be performed with 
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an update rule 

9 t+1 =O t + A9 t (27) 
AO cx ~V g J{6) (28) 

with steps taken in the direction given by the natural gradient. 

2.2 Metric G (0) 

If the objective function J (9) is the negative log likelihood of a probabilistic 
model q (x; 9) under an observed data distribution p (x) 

J(9) = -(logq(x;0)) p{x) (29) 

then the Fisher information matrix 

^^ = ( ' 1 ^ ;g) ' 1 ^ X;g) ) g(x , ) (30) 

is a good metric to use. 

If the objective function is not of of the form given in Equation |29| and 
cannot be transformed into that form, then greater creativity is required. See 
Section l2~8l for some basic hints. 

Remember, as will be discussed in Section [2.10[ even if the metric you choose 
is approximate, it is still likely to accelerate convergence! 

2.3 Fisher information over data distribution 

The Fisher information matrix (Equation |30[ ) requires averaging over the model 
distribution q (x; 6). For some models this is very difficult to do. If that is the 
case, instead taking the average over the empirical data distribution p (x) 

^^=( ' log ^ ;g)ai %t ;g) L (3i) 

is frequently an effective alternative. 

2.4 Energy approximation 

Parameter estimation in a probabilistic model of the form 

e -B(x;0) 

*(*) = (32) 

is in general very difficult, since it requires working with the frequently in- 
tractable partition function integral Z(9) = J e~ E ^' e 'dx. There are a number 
of techniques which can provide approximate learning gradients (eg minimum 
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probability flow Sohl-Dickstein et al., 20lib"]fSohl-Dickstcin et al., 2011a , con 



trastive divergence |Welling and Hinton, 2002} |Hinton, 2002|, score ma tching 



Hyvarinen, 2005| , mean field theory, and variational bayes Tanaka, 1998, Kappcn and Rodriguez, 1997, 



Jaakkol a and Jordan, 1997] , |Haykin, 2008] ) . Turning those gradients into nat- 
ural gradients is difficult though, as the Fisher information depends on the 
gradient of log Z (9) . Practically, simply ignoring the log Z (9) terms entirely 
and using a metric 

averaged over the data distribution works surprisingly well, and frequently 
greatly accelerates learning. 



2.5 Diagonal approximation 

G (9) is a square matrix of size N x N, where N is the number of parameters 
in the vector 9. For problems with large N, G _1 (9) can be impractically ex- 
pensive to compute and apply. For almost all problems however, the natural 
gradient still improves convergence even when off-diagonal elements of G (9) are 
neglected, 

%(*)^(^M) 2 ) jM (34, 

making inversion and application cost O (N) to perform. 

If the parameters can be divided up into several distinct classes (for instance 
the covariance matrix and means of a gaussian distribution), block diagonal 
forms may also be worth considering. 



2.6 Regularization 

Even if evaluating the full G is easy for your problem, you may still find that 
G 1 is ill conditioned^] Dealing with this — solving a set of linear equations 
subject to some regularization, rather than using an unstable matrix inverse — 
is an entire field of study in computer science. Here we give one simple plug 
and play technique, called stochastic robust approximation (Section 6.4.1 in 
|Boyd and Vand cnbcrg he, 2004| ), for regularizing the matrix inverse. If G 1 is 
replaced with 

G- e 1 g = (G T G + eiy 1 G T (35) 

3 This is a general problem when taking matrix inverses. A matrix A with random elements, 
or with noisy elements, will tend to have a few very very small eigenvalues. The eigenvalues 
of A" 1 are the inverses of the eigenvalues of A. A -1 will thus tend to have a few very very 
large eigenvalues, which will tend to make the elements of A -1 very very large. Even worse, 
the eigenvalues and eigenvectors which most dominate A -1 are those which were smallest, 
noisiest and least trustworthy in A. 
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where e is some small constant (say 0.01), the matrix inverse will be much better 
behaved. 

Alternatively, techniques such as ridge regression can be used to solve the 
linear equation 

G(0)V J(9) = V e J{6) (36) 

for V e J{6). 

2.7 Combining the natural gradient with other techniques 
using the natural parameter space 

It can be useful to combine the natural gradient with other gradient descent tech- 
niques. Blindly replacing all gradients with natural gradients frequently causes 
problems (line search implementations, for instance, depend on the gradients 
they are passed being the true gradients of the function they are descending). 
For a fixed value of G though there is a natural parameter space. 

= Gs (6 fixed )6 (37) 

in which the steepest gradient is the same as the natural gradient. 

In order to easily combine the natural gradient with other gradient descent 
techniques, fix 9fi xe d to the initial value of 9 and perform gradient descent over 
4> using any preferred algorithm. After a significant number of update steps 
convert back to update 9fi xe d to the new value of 9, and continue gradient 
descent in the new <j> space. 



2.8 Natural gradient of non-probabilistic models 

The techniques presented here are not unique to probabilistic models. The nat- 
ural gradient can be used in any context where a suitable metric can be written 
for the parameters. There are several approaches to writing an appropriate 
metric. 

1. If the objective function is of a form 

J(0) = <i(*;0)> P ( X ) (38) 

where (-) p ^ indicates averaging over some data distribution p(x), then it 
is sensible to choose a metric based on 

o„ W - <^^) jm 

2. Similarly, the penalty function can be treated as if it is the log likelihood 
of a probabilistic model, and the corresponding Fisher information matrix 
used. 
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For example, the task of minimizing an L2 penalty function | |y — f (x; 9) | | 2 
over observed pairs of data p (x, y) can be made probabilistic. Imagine 
that the L2 penalty instead represents a conditional gaussian q (y|x; ff) oc 
exp ^— ||y — f (x;0)|| 2 ^ over y, and use the observed marginal p(x) over 

x to build a joint distribution q (x, y; 9) = q (y|x; 9) p (x) This generates 
the metric: 

G , j(e) = / g log [g (y |x; ^) p (x)] 9 log [g (y |x; 6) p (x)] \ ^ 
51ogg(y|x;6») 9 log g(y|x; 9) \ 



dBl "J I ?(y|x;0)p(x) 

3. Find a set of parameter transformations T (9) which you believe the dis- 
tance measure \d0\ should be invariant to, and then find a metric G (9) 
such that this invariance holds. That is find G (9) such that the following 
relationship holds for any invariant transformation T{9), 

\(9 + d9)-9\ 2 = \T (9 + d9) - T (9)\ 2 . (42) 



A special case of this approach involves functions parametrized by a ma- 
trix, as presented in the next section. 



2.9 W T W 

As derived in |Amari, T9 98 , if a function depends on a (square, non-singular) 
matrix W, it frequently aids learning a great deal to take 

AW„ at oc ^W. (43) 

The algebra leading to this rule is complex, but as discussed in the previous 
section it falls out of a demand that the distance measure |dW| be invariant to 
a set of transformations applied to W. In this case, those transformations are 
right multiplication by any (non-singular) matrix Y. 

d9 T G {9) d9 = (d9Y) T G (6Y) {d9Y) (44) 



2.10 What if my approximation of A6 nat is wrong? 
For any positive definite H, movement in a direction 

h6 = HA6> (45) 

4 Amari [Amari, 1998 suggests using some uninformative model distribution q (x) over 
the inputs, such as a gaussian distribution, rather than taking p (x) from the data. Either 
approach will likely work well. 
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will descend the objective function. If the wrong H is used, gradient descent is 
performed in a suboptimal way . . . which is the problem when steepest gradient 
descent is used as well. Making an educated guess as to H rarely makes things 
worse, and frequently helps a great deal. 
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